=============== Fastcall macro quick guide, for BETA1. ===================

1. How it works?

      Fastcall is a set of macros for fasmg (developed at fasm2 environment) used to call functions in a "horizontal" manner, by following what is defined at System V ABI specification. It is enhanced to not only be C-style looking macro (it is similar, not equal), but to generate assembly code that is very close to what C compilers generate when calling functions. Because C is very good at calling functions; by this aspect, it is almost impossible even for an experienced in assembly language human programmer to beat C generated code (a draw is not accounted under this quote). So, I keep the most benefits I've learned from debugging C code implemented in this macro, as a collection of decisions based on parameter types and sizes and ordering. Hence the countless if blocks inside its code. ;) This, of course, comes at a cost: this is not so far the quickiest macro set ever: it hurts a little the fasmg compiler performance, in exchange to provide the best generated code it can. Maybe latter deployments can level up this issue.
      Its usage is enabled by defining a prototype earlier using 'proto' macro. Then, the proto macro will create a lot of hints for every parameter of the prototyped function, so fastcall can parse them when used, setting proper sizes and types of parameters in the right place. Also, proto macro generates a macro with the name of the function for every function, so, no need to 'fastcall function, args': you just 'function(args);' and that's it.

2. Important concepts

      1. Trashed registers

      This is the concept number 0 in understanding what you CANNOT do when using fastcall macro. In this case, avoid using the trashed register after is has been trashed. The order of trashed registers, are in reverse order, and the overall sequence of classes is also reverse ordering as follows:

           - memory integer class parameters will be pushed onto stack, and r9 is the auxiliary register for anything that needs a register to be prepared (e.g., an effective addressor 64-bit sized immediate number);
           - memory float point class parameters can be mixed with the above, following general prototyped RTL ordering, and they don't use any GP register at all; data is created as float point data in the first defined virtual location¹ found and then pushed as normal parameters;
           - exception is tword (or long double or l_double, all synonyms) data types, they are allocated at stack, aligned, and uses FPU register st0 for moving the 10-byte data, and that data might be also created in the first virtual location¹ as above;
           - then, comes xmm class parameters, they're created as data if immediate FP numbers, and/or passed directly to current xmm register otherwise;
           - then, comes register class parameters, in reverse order, passed onto corresponding position register.

        All of the above might be mixed, depending on the RTL sequence of the prototype for the called function. Important to remember is:

           Register class RTL order: r9, r8, rcx, rdx, rsi, rdi
           XMM class RTL order: xmm7, xmm6, xmm5, xmm4, xmm3, xmm2, xmm1, xmm0

        This shows the order of trashed registers, so for the above, the leftmost, when used by the prototype, is always the most volatile. This is why I choose r9 for being the only auxiliary register; so all others are safe until it comes its time in the RTL queue. And r9 is not used if there are no memory class tyep integer parameters to be passed on. Trashed registers also depends on how many register parameters the function has: if less, the volatile level will be displaced to the right of the above sequece. Example:

            proto func, qword, qword, dword, double, double  ; 3 integer parameters (register), 2 float point parameters (xmm)

            fastcall writing sequence: xmm1, xmm0, edx, rsi, rdi
            Trashed order (GP registers): rdx, rsi, rdi     ; rcx, r8 and r9 are immune here
            Trashed order (XMM registers): xmm1, xmm0       ; xmm7 to xmm2 are immune here



            proto func2, double, dword, double, double      ; 1 integer, 3 float

            fastcall writing sequence: xmm2, xmm1, edi, xmm0
            Trashed GP register order: rdi                  ; rsi, rdx, rcx, r8 and r9 are immune
            Trashed XMM register order: xmm2, xmm1, xmm0    ; xmm7 to xmm3 are immune

            In the case of func2(), no GP register is vulnerable, because rdi is already the last parameter of its kind.

        The prototype creates the definition in forward mode (LTR), so, one can find the current affected register at any parameter:

            proto x, dword, dword, dword, dword, dword, dword   ; 6 integer parameters
            proto y, double, double, double, double, double, double, double, double ; 8 float point parameters
            proto z, dword, double, dword, double, dword, double    ; mixed 3 x 3 integer x float point parameters
            proto j, qword, tword, qword, dword ; 1 long double mixed with 3 integers
            proto k, qword, qword, double, qword, qword, dword, dword, double, dword, dword ; mixed parameters
            proto l, qword, dword, qword, vararg    ; 3 integers and then vararg
            proto m, none   ; no parameters

            x => edi, esi, edx, ecx, r8d, r9d       ; x() => 6th, 5th, 4th, 3rd, 2nd, 1st
            y => xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7   ; y() => 8th, 7th, 6th, 5th, 4th, 3rd, 2nd, 1st
            z => edi, xmm0, esi, xmm1, edx, xmm2    ; z() => 6th, 5th, 4th, 3rd, 2nd, 1st
            j => rdi, [memory], rsi, edx            ; j() => 4th, [3rd], 2nd, 1st
            k => rdi, rsi, xmm0, rdx, rcx, r8d, r9d, xmm1, [memory], [memory] ; [10],[9],8,7,6,5,4,3,2,1
            l => rdi, esi, rdx, ...     ; ..., 3rd, 2nd, 1st, (hidden) al -> xmm count
            m =>    ; no register or memory affected

        For better understanding, it is recommended to read the System V ABI specification.
        In the prototype sequence, the rightmost are the first ones to be delivered, and then, they're also modified prior to the others; this makes rdi and xmm0 the safest registers to be used; they will survive until it is its time to be delivered. It means that one should not use, for example, rcx as the second parameter, because rcx will be modified first by the RTL order, then the second parameter will receive the 4th parameter data as well, because rcx is the 4th parameter destination, and in the RTL order is written first. Therefore, contents or rcx are now lost/replaced by 4th parameter.
        It is always better understandable if one is writing code and using a debugger to see its output. I must recommend edb-debugger for anyone using Linux. It's the birds-eye view of the program and the context aboard processor and memory.

        There's also an important rule when passing anything that is or uses rsp register: memory class parameters are passed onto stack, and they're usually passed first in RTL ordering: one must be cautious for calculating the exact rsp value at the given parameter rsp takes part. If rsp is passed after a memory parameter is passed RTL, its value has already changed, and even alignment calculation might had been modified rsp already! For this specification, Systm V ABI also can clarify better than I. Any uncertainty should be checked with a debugger if the expected behavior occured.

        An important rule to remember for the fascall when having memory class parameters is its sequence of events:

         - calculates and applies overall alignment at rsp;
         - passes parameters RTL: memory will be pushed, register will be passed, starting from the last;
         - if vararg type function, passes 'al' = number of xmm registers used;
         - calls or jumps function;
         - if call type, deallocates previous allocated stack space.

        And without memory class:

         - passes register (GP and xmm) parameters RTL as above;
         - if vararg type, 'al' register will have the number of xmm registers used;
         - calls or jumps function.

        Also, an important thing to be known by the user: the default method for creating imported functions entries in the final program is by macros built-in inside fastcall: it is copy-based on 'import64.inc' and 'elfsym.inc' files that came as an example in fasmg package, which generates the cleanest executable possible (also the smallest), free from all of the unneeded bloat common external linkers (like 'ld' or 'gcc') append to your file. One intended to use standard .o object model externally linked should modify the macro to use 'extrn' directive instead. But I had not tested this approach in any time during developing: maybe I'll fully support it properly in later revisions of this macro collection. With which is already included, you'll get the final ELF64 executable by just adding 'include fastcall.inc' as it is now.

        2. parameter formatting

        Fastcall parameters follow the following rules:

            &location  -> passes effective address of location (will use r9 to be prepared if memory class)
            *location  -> passes data at location
            [location] -> also passes data at location
            34567      -> pass 32 bit number
            9999999999 -> pass 64 bit number (will use r9 if integer memory class)
            (23*4+FIX) -> pass result as a number, 32 or 64 bit
            +3245      -> pass signed number
            -45637     -> pass signed number
            rax        -> pass register as integer parameter
            xmm5       -> pass xmm register as xmm class parameter or memory class parameter
            &rax+rcx*2+14 -> pass effective address of complex addressing parameter
            123.4567   -> pass float point parameter as (if prototyped) xmm or memory class
            **location -> gets the pointer at [location], and then pass the data located at this pointer
            "stringz"  -> creates "string",0 as db data, then passes a pointer to string data
            <"string"> -> creates literal "string" as db data (no null termination), and passes its pointer
            <"string",10,0> creates "string",10,0 as db data and passes its pointer

         Float point and string datas are created as data in the first virtual location¹.
         "string" simple string is always create with null termination, whereas <"string"> is created literally as "string" db data, without any addition.

         'proto' macro accepts the following types for parameters:

            byte    -> 1 byte, byte integer parameter
            word    -> 2 bytes, word integer parameter
            dword   -> 4 bytes, dword integer parameter
            qword   -> 8 bytes, qword integer parameter (default for vararg)
            tword   -> 10 bytes, tword float point parameter (known as long double)
            float   -> 32-bit single precision float point parameter
            double  -> 64-bit double precision float point parameter (default for vararg)
            vararg  -> from this and beyond, function is variable number of arguments
            none    -> function has no parameters

        'vararg' parameters has a best guess method for passing them in the most optized and coherent way, but every vararg parameter can be overriden by size prefixes below.

        Also, are available for fastcall size override prefixes:

            db param    -> pass param as forced byte size
            dw param    -> pass param as forced word size
            dd param    -> pass param as forced dword size
            dq param    -> pass param as forced qword size
            dt param    -> pass param as forced tword size
            df param    -> pass param as forced 32 bit float point size
            dp param    -> pass param as forced 64 bit float point size

        'param' can be any of the fastcall supported parameters, be careful!
        When using size override, it can override even the prototyped sizes, and results can be unpredictable. Recommended usage is only as a hint for vararg types.
        One can also use, for example, ' dd &eax+esi*4+22 ' to pass pamameter as 32-bit computed "address", which is supposed to be "high level equation" in runtime only passing as parameter.

3. Auxiliary macros

        There are the following macros used to support fastcall and proto macros:

            _rdata  -> creates a read-only data segment, and the most favorite virtual location¹ for anonymous data to be placed;
                    Your read-only data might be placed here.

            _data   -> creates a read-write data segment, and a virtual location¹ for anonymous data to be placed;
                    Your read/write data might be placed here.

            _code   -> create read execute code segment, the least favorite virtual location¹ for anonymous data to be placed;
                    Your code will probably be placed here.

            library -> defines the external libraries needed for the application, in quoted strings each;
                    Can be used multiple times

            extern  -> defines external functions needed for the application, in a sequence of named tokens;
                    This can be used only once!

            ext      -> proto prefix, indicates that this function should be prepared to be used from 'extern';
            noreturn -> proto prefix, prototypes the function as jmp function instead of call function;
            indirect -> proto prefix, prototypes function to be called indirectly (i.e., call [function]);

        Any of the following combinations are valid for proto macro:

            ext proto func, params                      ; external function, direct call func
            ext noreturn proto func, params             ; external function, direct jump func
            proto func, params                          ; internal function, direct call _func ²
            noreturn proto func, params                 ; internal function, direct jump _func ²
            ext indirect proto func, params             ; external function, indirect call [func]

            ext noreturn indirect proto func, params    ;
            ext indirect noreturn proto func, params    ; external function, indirect jump [func]

            indirect proto func, params                 ; internal function, indirect call [_ptr_func] ²

            noreturn indirect proto func, params        ;
            indirect noreturn proto func, params        ; internal function, indirect jump [_ptr_func] ²

        ² For internal function, there is a limitation so far: due to fasm not having backward reference support (or forward reference), 'macro func()' cannot 'fascall func', because they have same names. So, internal called functions shoud be prefixed as '_function:' for direct call/jmp, or the variable holding the function pointer should the prefixed as '_ptr_func'. The underline '_' is the differentiator:

             proto indirect pfunc1, qword
             proto func2, qword

             _pfunc1  dq func1
             ; ...
             func1: ; do stuff
                    ret

            _func2: ; do stuff
                    ret

            pfunc1(param); This calls [_pfunc1] pointer
            func2(param);  This calls _func2 directly


        The external functions' namespace is '__imp', and the default jump point for direct calls to external functions (located at namespace __imp too) is prefixed as 'proto', resulting in 'proto.function' external jump point name. This is similar to what PLT does, but it is not PLT! PLT is loaded on demand by the linker, this has everything resolved by dynamic linker at startup.
        'ext' defines one of two things: if direct type, it creates a jmp [__imp.function] entry at a separate code segment; if indirect type, it just append function name to a "list" of functions that can be parsed with 'irpv' directive (to be implemented, not working so far by still unknown reason).
        So, if your functions are called 'g_print', 'snprintf', 'exit' and 'getpid', for example, the whole thing will be:

            library 'libc.so.6', 'libglib-2.0.so.0' ; the needed libraries where functions are located

            extern g_print, snprintf, exit, getpid  ; External functions, all of them at once

            ; Functions' prototypes to be used by fastcall
            ext proto snprintf, qword, dword, qword, vararg
            ext noreturn proto exit, dword
            ext proto getpid, none
            ext indirect proto g_print, qword, vararg

            ; ...
            ; Your code below will fastcall functions as follows:

            snprintf(&buffer, 512, "Format anonymous string %s %u", &complement, r10d);
            ; Resolve to:
            ;    mov r8d, r10d
            ;    lea rcx, [complement]
            ;    lea rdx, [..anonymous_data_ptr]
            ;    mov esi, 512
            ;    lea rdi, [buffer]
            ;    xor al, al ; This is hidden vararg parameter from System V ABI, see the manual
            ;    call proto.snprintf        ; This is direct call to jump table
            ;    ; ...
            ;    ; Below is in the jump table:
            ;    proto.printf: jmp [__imp.snprintf]

            g_print(&buffer+64);
            ; Resolve to:
            ;    lea rdi, [buffer+64]
            ;    xor al, al
            ;    call [__imp.g_print]       ; This is indirect call to [__imp.g_print] pointer variable
            ; ...
            ;    ; No jump table for this one

            getpid();
            ; Resolve to:
            ;    call proto.getpid
            ;    ;...
            ;    ; Jump table:
            ;    proto.getpid: jmp [__imp.getpid]

            ; This example with dt size override at 2nd, and default double for 3rd:
            g_print("Long double is: %Lf, double is %lf",10,0>, dt 1234.567890, 987.654321);
            ; Resolve to:
            ;    movsd xmm0, [..anonymous_data_ptr] ; This ..anonymous_data_ptr is redefined every time a new
            ;    fld tword [..anonymous_data_ptr]   ; anonymous data is created at virtual location¹
            ;    fstp tword [rsp-16]
            ;    sub rsp, 16    ; alocate needed stack
            ;    lea rdi, [..anonymous_data_ptr]
            ;    mov al, 1  ; Hidden parameter: 1 xmm register used.
            ;    call [__imp.g_print]
            ;    add rsp, 16    ; release stack back
            ;


            exit(0);
            ; Resolve to:
            ;    sub rsp, 8     ; fastcall automatically allocates, aligns and releases stack for the function
            ;    xor eax, eax   ; best way of passing 0 as data
            ;    jmp proto.exit
            ;    ; ...
            ;    ; Jump table:
            ;    proto.exit: jmp [__imp.exit]

4. Tips, feedback and suggestions

      Any feedback will be appreciated if you're using this, and should preferably be posted in the topic: https://board.flatassembler.net/topic.php?t=23838, where I had announced this macro. Maybe more experienced people there can help faster than I can, and I also be there very often to help whenever I can.
      For bugs, or any failures that occured, the post should contain the code causing the problem and the output error (in any) of the fasmg/fasm2 assembler for better handling off the problem.
      For suggestions, also an usage example might be required, depending of what kind of suggestion it is.
      This is, as I said there, more an idea than a final undivisible/unchangeable concept: anyone can modify this and enhance it for it to fit one's purposes, and for benefit for others too. Updates on it can be freely posted at the forum, or even e-mailed me at: jesse6.2023@proton.me, also suggestions and feedback as well.
      As a final word, I must alert one using this that I made this macros on demand, so I only implemented what I could and can use and test from System V ABI specification: it has more features that, by myself not finding yet usage for it, I have no way to test and implement them properly. If anyone finds something missing, please send example code that uses it so I can learn from your example and implement or repair it.






